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Abstract. We propose a general approach to compute the seed sensi- 
tivity, that can be applied to different definitions of seeds. It treats sepa- 
rately three components of the seed sensitivity problem - a set of target 
alignments, an associated probability distribution, and a seed model - 
that are specified by distinct finite automata. The approach is then ap- 
plied to a new concept of subset seeds for which we propose an efficient 
automaton construction. Experimental results confirm that sensitive sub- 
set seeds can be efficiently designed using our approach, and can then be 
used in similarity search producing better results than ordinary spaced 
seeds. 

1 Introduction 

In the framework of pattern matching and similarity search in biological se- 
quences, seeds specify a class of short sequence motif which, if shared by two 
sequences, are assumed to witness a potential similarity. Spaced seeds have been 
introduced several years ago [1,2] and have been shown to improve significantly 
the efficiency of the search. One of the key problems associated with spaced seeds 
is a precise estimation of the sensitivity of the associated search method. This 
is important for comparing seeds and for choosing most appropriate seeds for a 
sequence comparison problem to solve. 

The problem of seed sensitivity depends on several components. First, it 
depends on the seed model specifying the class of allowed seeds and the way that 
seeds match (hit) potential alignments. In the basic case, seeds are specified by 
binary words of certain length {span), possibly with a constraint on the number 
of I's (weight). However, different extensions of this basic seed model have been 
proposed in the hterature, such as multi-seed (or multi-hit) strategies [3,4,2], 
seed families [5,6,7,8,9,10], seeds over non-binary alphabets [11,12], vector seeds 
[13,10]. 

The second parameter is the class of target alignments that are alignment 
fragments that one aims to detect. Usually, these are gapless alignments of a 
given length. Gapless alignments are easy to model, in the simplest case they 
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are represented by binary sequences in the match/mismatch alphabet. This rep- 
resentation has been adopted by many authors [2,14,15,16,17,18]. The binary 
representation, however, cannot distinguish between different types of matches 
and mismatches, and is clearly insufficient in the case of protein sequences. In 
[13,10], an alignment is represented by a sequence of real numbers that are 
scores of matches or mismatches at corresponding positions. A related, but yet 
different approach is suggested in [12], where DNA alignments are represented 
by sequences on the ternary alphabet of match/transition/transversion. Finally, 
another generalization of simple binary sequences was considered in [19], where 
alignments are required to be homogeneous, i.e. to contain no sub-alignment with 
a score larger than the entire alignment. 

The third necessary ingredient for seed sensitivity estimation is the proba- 
bility distribution on the set of target alignments. Again, in the simplest case, 
alignment sequences are assumed to obey a Bernoulli model [2,16]. In more 
general settings, Markov or Hidden Markov models are considered [17,15]. A 
different way of defining probabilities on binary alignments has been taken in 
[19]: all homogeneous alignments of a given length are considered equiprobable. 

Several algorithms for computing the seed sensitivity for different frameworks 
have been proposed in the above-mentioned papers. All of them, however, use a 
common dynamic programming (DP) approach, first brought up in [14]. 

In the present paper, we propose a general approach to computing the seed 
sensitivity. This approach subsumes the cases considered in the above-mentioned 
papers, and allows to deal with new combinations of the three seed sensitivity 
parameters. The underlying idea of our approach is to specify each of the three 
components - the seed, the set of target alignments, and the probability distri- 
bution - by a separate finite automaton. 

A deterministic finite automaton (DFA) that recognizes all alignments matched 
by given seeds was already used in [17] for the case of ordinary spaced seeds. In 
this paper, we assume that the set of target alignments is also specified by a DFA 
and, more importantly, that the probabilistic model is specified by a probability 
transducer - a probability-generating finite automaton equivalent to HMM with 
respect to the class of generated probability distributions. 

We show that once these three automata are set, the seed sensitivity can be 
computed by a unique general algorithm. This algorithm reduces the problem to 
a computation of the total weight over all paths in an acyclic graph corresponding 
to the automaton resulting from the product of the three automata. This com- 
putation can be done by a well-known dynamic programming algorithm [20,21] 
with the time complexity proportional to the number of transitions of the re- 
sulting automaton. Interestingly, all above-mentioned seed sensitivity algorithms 
considered by different authors can be reformulated as instances of this general 
algorithm. 

In the second part of this work, we study a new concept of subset seeds - 
an extension of spaced seeds that allows to deal with a non-binary alignment 
alphabet and, on the other hand, still allows an efficient hashing method to 
locate seeds. For this definition of seeds, we define a DFA with a number of 
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states independent of the size of the ahgnment alphabet. Reduced to the case of 
ordinary spaced seeds, this DFA construction gives the same worst-case number 
of states as the Aho-Corasick DFA used in [17]. Moreover, our DFA has always 
no more states than the DFA of [17], and has substantially less states on average. 

Together with the general approach proposed in the first part, our DFA 
gives an efficient algorithm for computing the sensitivity of subset seeds, for 
different classes of target alignments and different probability transducers. In the 
experimental part of this work, wc confirm this by running an implementation of 
our algorithm in order to design efficient subset seeds for different probabilistic 
models, trained on real genomic data. We also show experimentally that designed 
subset seeds allow to find more significant alignments than ordinary spaced seeds 
of equivalent selectivity. 

2 General Framework 

Estimating the seed sensitivity amounts to compute the probability for a random 
word (target alignment), drawn according to a given probabilistic model, to 
belong to a given language, namely the language of all alignments matched by 
a given seed (or a set of seeds). 

2.1 Tgirget Alignments 

Target alignments are represented by words over an alignment alphabet A. 
In the simplest case, considered most often, the alphabet is binary and ex- 
presses a match or a mismatch occurring at each alignment column. However, 
it could be useful to consider larger alphabets, such as the ternary alphabet of 
match/transition/transversion for the case of DNA (see [12]). The importance 
of this extension is even more evident for the protein case ([10]), where different 
typos of amino acid pairs arc generally distinguished. 

Usually, the set of target alignments is a finite set. In the case considered 
most often [2,14,15,16,17,18], target alignments are all words of a given length 
n. This set is trivially a regular language that can be specified by a determinis- 
tic automaton with (n -|- 1) states. However, more complex definitions of target 
alignments have been considered (see e.g. [19]) that aim to capture more ade- 
quately properties of biologically relevant alignments. In general, we assume that 
the set of target alignments is a finite regular language Lt & A* and thus can 
be represented by an acyclic DFA T =< Qt, q'^ , q!p , A, ipr >■ 

2.2 Probability Assignment 

Once an alignment language Lt has been set, we have to define a probability 
distribution on the words of Lt- We do this using probability transducers. 

A probability transducer is a finite automaton without final states in which 
each transition outputs a probability. 
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Definition 1. A probability transducer G over an alphabet A is a 4-tuple < 
Qa,qG, A, pg >, where Qq is a finite set of states, Qq S Qq is an initial state, 
and pg '■ Qg x ^ x Qg [0, 1] is a real-valued probability function such that 

Vg e Qg, Eg'sQ^ae^ PgIq, a, q') = 1. 

A transition of G is a triplet e =< q, a, q' > such that p{q, a, q') > 0. Letter 
a is called the label of e and denoted label{e). A probability transducer G is 
deterministic if for each q G Qg and each a E A, there is at most one transition 
< q, a, q' >. For each path P = (ei, e„) in G, we define its label to be the word 
label{P) = label{ei) ...label{en) , and the associated probability to be the product 
p{P) = nr=i PGi^i)- A path is initial, if its start state is the initial state of 
the transducer G. 

Definition 2. The probability of a word w E A* according to a probability 
transducer G =< QgjQg^-^'Pg >> denoted VGiw), is the sum of probabilities 

of all initial paths in G with the label w. VGiw) = if no such path exists. The 
probability Vg{L) of a finite language L C A* according a probability transducer 
G is defined by Vg{L) = JZweL'^oiw). 

Note that for any n and for L = A" (all words of length n), VGiL) = 1. 

Probability transducers can express common probability distributions on 
words (alignments). Bernoulli sequences with independent probabilities of each 
symbol [2,16,18] can be specified with deterministic one-state probability trans- 
ducers. In Markov sequences of order k [17,6], the probability of each symbol 
depends on k previous symbols. They can therefore be specified by a determin- 
istic probability transducer with at most l^ll*^ states. 

A Hidden Markov model (HMM) [15] corresponds, in general, to a non- 
deterministic probability transducer. The states of this transducer correspond 
to the (hidden) states of the HMM, plus possibly an additional initial state. In- 
versely, for each probability transducer, one can construct an HMM generating 
the same probability distribution on words. Therefore, non-deterministic proba- 
bility transducers and HMMs are equivalent with respect to the class of generated 
probability distributions. The proofs are straightforward and are omitted due to 
space limitations. 

2.3 Seed automata and seed sensitivity 

Since the advent of spaced seeds [1,2], different extensions of this idea have 
been proposed in the literature (see Introduction). For all of them, the set of 
possible alignment fragments matched by a seed (or by a set of seeds) is a 
finite set, and therefore the set of matched alignments is a regular language. For 
the original spaced seed model, this observation was used by Buhlcr ct al. [17] 
who proposed an algorithm for computing the seed sensitivity based on a DFA 
defining the language of alignments matched by the seed. In this paper, we 
extend this approach to a general one that allows a uniform computation of seed 
sensitivity for a wide class of settings including different probability distributions 
on target alignments, as well as different seed definitions. 
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Consider a seed (or a set of seeds) tt under a given seed model. We assume 
that the set of ahgnments L^r matched by tt is a regular language recognized by 
a DFA S„ =< Qs, Qs, Qs , ■A.jtps >■ Consider a finite set Lt of target alignments 
and a probability transducer G. Under this assumptions, the sensitivity of tt is 
defined as the conditional probability 

Vg{Lt) 

An automaton recognizing L = Lt n -L,r can be obtained as the product of au- 
tomata T and recognizing Lt and L„ respectively. Let K =< Qk, Qk^ Qk-> 
il)K > be this automaton. We now consider the product W of K and G, denoted 
K X G, defined as follows. 

Definition 3. Given a DFA K =< Qk^Q^tQ^tA^iPk > and a probability 
transducer G =< QcQgjA, pa >, the product of K and G is the probability- 
weighted automaton W =< Qw,Qw^Qw^-^' Pw > (for short, PW-automaton^ 

such that Qw = Qk ^ Qg, Qw = (Qk^Ig)' ^Iw = {{QK,qG)\qK G Qk}> 
Pw{{qK-qG)-a,..{q'j^,q'Q)) = pGiqa^a^q'^) if^K{qK,a) =q'x, and otherwise. 

W can be viewed as a non-dctcrministic probability transducer with final states. 
Pw{{qK,qG),a, {q'li, q'o)) is the probability of the < {qK, qG), a, (q'x, qc) > tran- 
sition. A path in W is called full if it goes from the initial to a final state. 

Lemma 1. Let G be a probability transducer. Let L be a finite language and K 
be a deterministic automaton recognizing L. Let W = G x K . The probability 
Vg{L) is equal to sum of probabilities of all full paths in W. 

Proof. Since K is a, deterministic automaton, each word w G L corresponds to a 
single accepting path in K and the paths in G labeled w (see Definition 1) are in 
one-to-one correspondence with the full path in W accepting w. By definition, 
Vg{w) is equal to the sum of probabilities of all paths in G labeled w. Each such 
path corresponds to a unique path in W, with the same probability. Therefore, 
the probability of w is the sum of probabilities of corresponding paths in W. 
Each such path is a full path, and paths for distinct words w are disjoint. The 
lemma follows. 

2.4 Computing Seed Sensitivity 

Lcinima 1 reduces the computation of sccicl sensitivity to a computation of the 
sum of probabilities of paths in a PW-automaton. 

Lemma 2. Consider an alignment alphabet A, a finite set Lt C A* of target 
alignments, and a set C A* of all alignments matched by a given seed tt. 
Let K =< QK,qt^QKTA,tpQ > be an acyclic DFA recognizing the language 
L = Lt n Lj^. Let further G Qg, qQ^ A, p > be a probability transd,ucer 
defining a probability distribution on the set Lt. Then Vg{L) can be computed 
in time 0{\Qg\'^ ■ \Qk\ ■ \A\) and space 0{\Qg\ ■ \Qk\)- 
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Proof. By Lemma 1, the probability of L with respect to G can be computed as 
the sum of probabiUties of all full paths in W. Since K is an acyclic automaton, 
so is W. Therefore, the sum of probabilities of all full paths in W leading to final 
states can be computed by a classical DP algorithm [20] applied to acyclic 
directed graphs ([21] presents a survey of application of this technique to different 
bioinformatic problems). The time complexity of the algorithm is proportional 
to the number of transitions in W. W has \Qg \ • \Qk\ states, and for each letter 
of A, each state has at most \Qg\ outgoing transitions. The bounds follow. 

Lemma 2 provides a general approach to compute the seed sensitivity. To 
apply the approach, one has to define three automata: 

— a deterministic acyclic DFA T specifying a set of target alignments over 
an alphabet A (e.g. all words of a given length, possibly verifying some 
additional properties), 

~ a (generally non-deterministic) probability transducer G specifying a prob- 
ability distribution on target alignments (e.g. Bernoulli model, Markov se- 
quence of order k, HMM), 

— a deterministic DFA 5,r specifying the seed model via a set of matched 
alignments. 

As soon as these three automata arc defined. Lemma 2 can be used to compute 
probabilities Vg{Lt H L^r) and VciLr) in order to estimate the seed sensitivity 
according to equation (1). 

Note that if the probability transducer G is deterministic (as it is the case 
for Bernoulli models or Markov sequences), then the time complexity is 0{\Qg \ ■ 
\Qk\ • l-^D- In general, the complexity of the algorithm can be improved by re- 
ducing the involved automata. Buhler et al. [17] introduced the idea of using the 
Aho-Corasick automaton [22] as the seed automaton for a spaced seed. The 
authors of [17] considered all binary alignments of a fixed length n distributed 
according to a Markov model of order k. In this setting, the obtained complex- 
ity was 0{w2^~^2''n), where s and w are seed's span and weight respectively. 
Given that the size of the Aho-Corasick automaton is 0{w2'^~^), this complexity 
is automatically implied by Lemma 2, as the size of the probability transducer is 
0{2''), and that of the target alignment automaton is 0{n). Compared to [17], 
our approach explicitly distinguishes the descriptions of matched alignments 
and their probabilities, which allows us to automatically extend the algorithm 
to more general cases. 

Note that the idea of using the Aho-Corasick automaton can be applied to 
more general seed models than individual spaced seeds (e.g. to multiple spaced 
seeds, as pointed out in [17]). In fact, all currently proposed seed models can be 
described by a finite set of matched alignment fragments, for which the Aho- 
Corasick automaton can be constructed. We will use this remark in later sections. 

The sensitivity of a spaced seed with respect to an HMM-specified probability 
distribution over binary target alignments of a given length n was studied by 
Brejova et al. [15]. The DP algorithm of [15] has a lot in common with the 
algorithm implied by Lemma 2. In particular, the states of the algorithm of [15] 
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are triples < w,q,m >, where «; is a prefix of the seed tt, g is a state of the HMM, 
and m e [0..n]. The states therefore correspond to the construction impHed by 
Lemma 2. However, the authors of [15] do not consider any automata, which 
does not allow to optimize the preprocessing step (counterpart of the automaton 
construction) and, on the other hand, docs not allow to extend the algorithm to 
more general seed models and/or different sets of target alignments. 

A key to an efficient solution of the sensitivity problem remains the definition 
of the seed. It should be expressive enough to be able to take into account 
properties of biological sequences. On the other hand, it should be simple enough 
to be able to locate seeds fast and to get an efficient algorithm for computing 
seed sensitivity. According to the approach presented in this section, the latter 
is directly related to the size of a DFA specifying the seed. 

3 Subset seeds 

3.1 Definition 

Ordinary spaced seeds use the simplest possible binary "match-mismatch" align- 
ment model that allows an efficient implementation by hashing all occurring 
combinations of matching positions. A powerful generalization of spaced seeds, 
called vector seeds, has been introduced in [13]. Vector seeds allow one to use 
an arbitrary alignment alphabet and, on the other hand, provide a flexible def- 
inition of a hit based on a cooperative contribution of seed positions. A much 
higher expressiveness of vector seeds lead to more complicated algorithms and, 
in particular, prevents the application of direct hashing methods at the seed 
location stage. 

In this section, we consider subset seeds that have an intermediate expressive- 
ness between spaced and vector seeds. It allows an arbitrary alignment alphabet 
and, on the other hand, still allows using a direct hashing for locating seed, which 
maps each string to a unique entry of the hash table. We also propose a con- 
struction of a seed automaton for subset seeds, different from the Aho-Corasick 
automaton. The automaton has 0{w2^~'^) states regardless of the size of the 
alignment alphabet, where s and w are respectively the span of the seed and 
the number of "must-match" positions. From the general algorithmic framework 
presented in the previous section (Lemma 2), this implies that the seed sensi- 
tivity can be computed for subset seeds with same complexity as for ordinary 
spaced seeds. Note also that for the binary alignment alphabet, this bound is the 
same as the one implied by the Aho-Corasick automaton. However, for larger 
alphabets, the Aho-Corasick construction leads to 0{w\A\^~'^) states. In the 
experimental part of this paper (section 4.1) wc will show that even for the bi- 
nary alphabet, our automaton construction yields a smaller number of states in 
practice. 

Consider an alignment alphabet A. We always assume that A contains a 
symbol 1, interpreted as "match". A subset seed is defined as a word over a seed 
alphabet B, such that 



8 



G. Kucherov, L. Noe, M. Roytberg 



— letters of B denote subsets of the alignment alphabet A containing 1 (S C 
{1}U2'^), 

— B contains a letter # that denotes subset {1}, 

— a subset seed 6162 ... 6^ G matches an alignment fragment ai02 . . . € 
A"" if Vi e [l..m], at G h. 

The ^-weight of a subset seed n is the number of # in tt and the span of n is 
its length. 

Example 1. [12] considered the ahgnment alphabet A = {l,h, 0} representing 
respectively a match, a transition mismatch, or a transversion mismatch in a 
DNA sequence alignment. The seed alphabet is B = {#,©,_} denoting respec- 
tively subsets {1}, {1, h}, and {1, h, 0}. Thus, seed tt = #@-# matches alignment 
s = lOhlhllOl at positions 4 and 6. The span of tt is 4, and the #-weight of tt 
is 2. 

Note that unlike the weight of ordinary spaced seeds, the #-weight cannot serve 
as a measure of seed selectivity. In the above example, symbol @ should be 
assigned weight 0.5, so that the weight of tt is equal to 2.5 (see [12]). 



3.2 Subset Seed Automaton 



Let us fix an alignment alphabet A, a seed alphabet B, and a seed tt = ^1^2 ■ ■ - T^m G 
B* of span m and #-weight w. Let R.^ be the set of all non-# positions in tt, 
|i?7r| = r = m—w. We now define an automaton St^ =< Q,qo,Qf,A,il} : Qy.A^ 
Q > that recognizes the set of all alignments matched by tt. 

The states Q of St^ are pairs < X,t > such that X C R^,t £ [0, . . . , m], with 
the following invariant condition. Suppose that S'^ has read a prefix si ... Sp of 
an alignment s and has come to a state < X,t >. Then t is the length of the 
longest suffix of si . . . Sp of the form V, i < m, and X contains all positions 
Xi G R„ such that prefix tti • • • tt^^ of tt matches a suffix of si • • • Sp-t- 

S9 t 

(a) TT = #@#_ ##_### (c) _ ■'■■'li^i^ii^^ ' ' ' 

(b) s= lllhlOllhll. . . 

^ ' 771. .2 =#<3 



Fig. 1. Illustration to Example 2 



Example 2. In the framework of Example 1 , consider a seed tt and an alignment 
prefix s of length p = 11 given on Figure 1(a) and 1(b) respectively. The length t 
of the last run of I's of s is 2. The last mismatch position of s is sg = h. The set 
R„ of non-# positions of n is {2, 4, 7} and tt has 3 prefixes ending at positions 
of i?7r (Figure 1(c)). Prefixes 7ri,.2 and ttl.y do match suffixes of S1S2 . . . sg, and 
prefix 7ri..4 does not. Thus, the state of the automaton after reading S1S2 . . . sii 
is < {2,7},2 >. 
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The initial state qo of S.^ is the state < 0, >. The final states Qf of S'tt are 
all states q =< X,t >, where max{X} + t = m. All final states are merged into 
one state. 

The transition function ■tp{q, a) is defined as follows: If g is a final state, then 
Va e A, il>{q, a) = q.lf q =< X,t > is a non-final state, then 

— if o = 1 then ^'('Z, a) =< X,t + 1 >, 

— otherwise ip{q, a) =< Xu U Xy, > with 

• Xu = {x\x <t + l and a G ttx} 

• Xv = {x + t+ l\x G X and a G nx+t+i} 

Lemma 3. The automaton 5^ accepts the set of all alignments matched by it. 

Proof. It can be verified by induction that the invariant condition on the states 

< X,t >€ Q is preserved by the transition function tjj. The final states veriiy 
max{X} + t — m, which implies that tt matches a suffix of si . . . Sp. 

Lemma 4. The number of states of the automaton S^^ is no more than {w+l)2^ . 

Proof. Assume that = {xi,X2, ■ ■ ■ ,Xr} and xi < X2 ■ ■ ■ < Xr- Let Qi be 
the set of non-final states < X,t > with max{X} = Xi, i G [^■■f]. For states 
q =< X,t ><= Qi there arc 2^~^ possible values of X and m — Xi possible values 
of t, as m.ax{X} + t < m - I. Thus, \Qi\ < 2'-^{m - Xj) < 2*-i(m - i) 
and \Qi\ — X^i=i 2'~'^(m — i) = {m — r + 1)2'' — m — 1. Besides states 

Qi, Q contains m states < 0,t > (t G [0..m — 1]) and one final state. Thus, 
101 < (to - r + 1)2'- = (w + 1)2''. 

Note that if tt starts with which is always the case for ordinary spaced 
seeds, then Xi > i + 1, i G [l..r], and previous bound rewrites to 2*~-^(to — z — 1). 
This rcsiilts in the same number of states w2^ as for the Aho-Corasick automaton 
[17]. The construction of automaton S'jr is optimal, in the sense that no two 
states can be merged in general. A straightforward generation of the transition 
table of the automaton 5*,^ can be performed in time 0{r ■ w ■ 2^ ■ \ A\). A more 
complicated algorithm allows one to reduce the bound to 0{w ■ 2^ ■ \A\). In the 
next section, we demonstrate experimentally that on average, our construction 
yields a very compact automaton, close to the minimal one. Together with the 
general approach of section 2, this provides a fast algorithm for computing the 
sensitivity of subset seeds and, in turn, allows to perform an efficient design of 
spaced seeds well-adapted to the similarity search problem under interest. 

4 Experiments 

Several types of experiments have been performed to test the practical applica- 
bility of the results of sections 2,3. We focused on DNA similarity search, and 
set the alignment alphabet .4. to 0} (match, transition, transversion). For 
subset seeds, the seed alphabet B was set to {#,@,_}, where # = {1},@ = 
{l,h},_ = {l,h, 0} (see Example 1). The weight of a subset seed is computed by 
assigning weights 1, 0.5 and to symbols @ and _ respectively. 
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4.1 Size of the automaton 

We compared the size of the automaton S"^ defined in section 3 and the Aho- 
Corasick automaton [22], both for ordinary spaced seeds (binary seed alphabet) 
and for subset seeds (ternary seed alphabet). The Aho-Corasick automaton for 
spaced seeds was constructed as defined in [17]. For subset seeds, a straightfor- 
ward generalization was considered: the Aho-Corasick construction was applied 
to the set of alignment fragments matched by the seed. 

Tables 1(a) and 1(b) present the results for spaced seeds and subset seeds 
respectively. For each seed weight w, we computed the average number of states 
{avg. s.) of the Aho-Corasick automaton and our automaton S^^, and reported 
the corresponding ratio (S) with respect to the average number of states of the 
minimized automaton. The average was computed over all seeds of span up to 
w + 8 for spaced seeds and all seeds of span up to w + 5 with two @'s for subset 
seeds. Interestingly, our automaton turns out to be more compact than the Aho- 
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(a) (b) 
Table 1. Comparison of the average number of states of Aho-Corasick automaton, 
automaton Sjr of section 3 and minimized automaton 



Corasick automaton not only on non-binary alphabets (which was expected), but 
also on the binary alphabet (ef Table 1(a)). Note that for a given seed, one can 
define a surjective mapping from the states of the Aho-Corasick automaton onto 
the states of our automaton. This implies that our automaton has always no 
more states than the Aho-Corasick automaton. 

4.2 Seed Design 

In this part, we considered several probability transducers to design spaced or 
subset seeds. The target alignments included all alignments of length 64 on 
alphabet {l,h, 0}. Four probability transducers have been studied (analogous to 
those introduced in [23]): 

— B: Bernoulli model 

— DTI: deterministic probability transducer specifying probabilities of {1, h, 0} 
at each codon position (extension of the M^^^ model of [23] to the three-letter 

alphabet) , 

— DT2: deterministic probability transducer specifying probabilities of each of 
the 27 codon instances {l,h, O}^ (extension of the M^^^ model of [23]), 

— NT: non-deterministic probability transducer combining four copies of DT2 
specifying four distinct codon conservation levels (called HMM model in 
[23]). 
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Table 2. Best seeds and their sensitivity for probability transducer B 
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Table 3. Best seeds and their sensitivity for probability transducer DTI 



Models DTI, DT2 and NT have been trained on alignments resulting from a 
pairwise comparison of 40 bacteria genomes. For each of the four probability 
transducers, we computed the best seed of weight w (w = 9, 10, 11, 12) among 
two categories: ordinary spaced seeds of weight w and subset seeds of weight w 
with two @. Ordinary spaced seeds were enumerated exhaustively up to a given 
span, and for each seed, the sensitivity was computed using the algorithmic 
approach of section 2 and the seed automaton construction of section 3. Each 
such computation took between 10 and 500ms on a Pentium IV 2.4GHz computer 
depending on the seed weight/span and the model used. In each experiment, the 
most sensitive seed found has been kept. The results are presented in Tables 2-5. 

In all cases, subset seeds yield a better sensitivity than ordinary spaced seeds. 
The sensitivity increment varies up to 0.04 which is a notable increase. As shown 
in [12], the gain in using subset seeds increases substantially when the transition 
probability is greater than the transversion probability, which is very often the 
case in related genomes. 



5 Discussion 

We introduced a general framework for computing the seed sensitivity for various 

similarity search settings. The approach can be seen as a generalization of meth- 
ods of [17,15] in that it allows to obtain algorithms with the same worst-case 
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Table 4. Best seeds and their sensitivity for probability transducer DT2 
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Table 5. -Best seeds and their sensitivity for probability transducer NT 



complexity bounds as those proposed in these papers, but also allows to obtain 
efficient algorithms for new formulations of the seed sensitivity problem. This 
versatility is achieved by distinguishing and treating separately the three ingre- 
dients of the seed sensitivity problem: a set of target alignments, an associated 
probability distributions, and a seed model. 

We then studied a new concept of subset seeds which represents an inter- 
esting compromise between the efficiency of spaced seeds and the flexibility of 
vector seeds. For this type of seeds, we defined an automaton with 0{w2^) states 
regardless of the size of the alignment alphabet, and showed that its transition 
table can be constructed in time C'(w2''|^|). Projected to the case of spaced 
seeds, this construction gives the same worst-case bound as the Aho-Corasick 
automaton of [17], but results in a smaller number of states in practice. Different 
experiments we have done confirm the practical efficiency of the whole method, 
both at the level of computing sensitivity for designing good seeds, as well as 
using those seeds for DNA similarity search. 

As far as the future work is concerned, it would be interesting to study the 
design of efficient spaced seeds for protein sequence search (see [10]), as well as 
to combine spaced seeds with other techniques such as seed families [5,6,8] or 
the group hit criterion [12]. 
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